Linear Regressions (Refresher)

DDD: Elements of Statistical Machine Learning & Politics of Data

Ayush Patel

At Azim Premji University, Bhopal

27 Jan, 2026

Hello

I am Ayush.

I am a researcher working at the intersection of data, development and economics.

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Did you come prepared?

  • You have installed R. If not see this link.

  • You have installed RStudio/Positron/VScode or any other IDE. It is recommended that you work through an IDE

  • You have the libraries {tidyverse} {caret} {ISLR} {ISLR2} {openintro} {broom} installed

Learning Goals

  1. Apply and interpret OLS for Linear Models
  2. Identify and address problems with Linear Models

Linear Model

  1. Used for prediction and inference when response is quantitative.
  2. Applied when the relation bentwen the response and predictor(s) is assumed to be close to linear.
  3. A linear model with one predictor is referred to as the Simple Linear model and that with more than one predictors is called Multiple Linear Model.

Simple Linear Model

Simple Linear Model

  • A perfect linear relationship is unrealistic for any natural process.
  • So, it is not possible to predict the exact value of \(y\) just by knowing \(x\).
  • Ex: Family income (\(x\)) and financial support (\(y\)) to a student by a college.
  • This doesn’t mean one cannot make a reasonably good estimate of \(y\) using \(x\).

Simple Linear Model

The relationship between \(x\) and \(y\) can be modeled as a straight line with some error:

\(y = b_0 + b_1x + \epsilon\)

\(b_0\) is the intercept and \(b_1\) is the slope of the line. Error is represented by \(\epsilon\)

Possums

site pop sex age head_l skull_w total_l tail_l
1 Vic m 8 94.1 60.4 89.0 36.0
1 Vic f 6 92.5 57.6 91.5 36.5
1 Vic f 6 94.0 60.0 95.5 39.0
1 Vic f 6 93.2 57.1 92.0 38.0
1 Vic f 2 91.5 56.3 85.5 36.0
1 Vic f 1 93.1 54.8 90.5 35.5
1 Vic m 2 95.3 58.2 89.5 36.0
1 Vic f 6 94.8 57.6 91.0 37.0
1 Vic f 9 93.4 56.3 91.5 37.0
1 Vic f 6 91.8 58.0 89.5 37.5
1 Vic f 9 93.3 57.2 89.5 39.0
1 Vic f 5 94.9 55.6 92.0 35.5
1 Vic m 5 95.1 59.9 89.5 36.0
1 Vic m 3 95.4 57.6 91.5 36.0
1 Vic m 5 92.9 57.6 85.5 34.0
1 Vic m 4 91.6 56.0 86.0 34.5
1 Vic f 1 94.7 67.7 89.5 36.5
1 Vic m 2 93.5 55.7 90.0 36.0
1 Vic f 5 94.4 55.4 90.5 35.0
1 Vic f 4 94.8 56.3 89.0 38.0
1 Vic f 3 95.9 58.1 96.5 39.5
1 Vic m 3 96.3 58.5 91.0 39.5
1 Vic f 4 92.5 56.1 89.0 36.0
1 Vic m 2 94.4 54.9 84.0 34.0
1 Vic m 3 95.8 58.5 91.5 35.5
1 Vic m 7 96.0 59.0 90.0 36.0
1 Vic f 2 90.5 54.5 85.0 35.0
1 Vic m 4 93.8 56.8 87.0 34.5
1 Vic f 3 92.8 56.0 88.0 35.0
1 Vic f 2 92.1 54.4 84.0 33.5
1 Vic m 3 92.8 54.1 93.0 37.0
1 Vic f 4 94.3 56.7 94.0 39.0
1 Vic m 3 91.4 54.6 89.0 37.0
2 Vic m 2 90.6 55.7 85.5 36.5
2 Vic m 4 94.4 57.9 85.0 35.5
2 Vic m 7 93.3 59.3 88.0 35.0
2 Vic f 2 89.3 54.8 82.5 35.0
2 Vic m 7 92.4 56.0 80.5 35.5
2 Vic f 1 84.7 51.5 75.0 34.0
2 Vic f 3 91.0 55.0 84.5 36.0
2 Vic f 5 88.4 57.0 83.0 36.5
2 Vic m 3 85.3 54.1 77.0 32.0
2 Vic f 2 90.0 55.5 81.0 32.0
2 Vic m NA 85.1 51.5 76.0 35.5
2 Vic m 3 90.7 55.9 81.0 34.0
2 Vic m NA 91.4 54.4 84.0 35.0
3 other m 2 90.1 54.8 89.0 37.5
3 other m 5 98.6 63.2 85.0 34.0
3 other m 4 95.4 59.2 85.0 37.0
3 other f 5 91.6 56.4 88.0 38.0
3 other f 5 95.6 59.6 85.0 36.0
3 other m 6 97.6 61.0 93.5 40.0
3 other f 3 93.1 58.1 91.0 38.0
4 other m 7 96.9 63.0 91.5 43.0
4 other m 2 103.1 63.2 92.5 38.0
4 other m 3 99.9 61.5 93.7 38.0
4 other f 4 95.1 59.4 93.0 41.0
4 other m 3 94.5 64.2 91.0 39.0
4 other m 2 102.5 62.8 96.0 40.0
4 other f 2 91.3 57.7 88.0 39.0
5 other m 7 95.7 59.0 86.0 38.0
5 other f 3 91.3 58.0 90.5 39.0
5 other f 6 92.0 56.4 88.5 38.0
5 other f 3 96.9 56.5 89.5 38.5
5 other f 5 93.5 57.4 88.5 38.0
5 other f 3 90.4 55.8 86.0 36.5
5 other m 4 93.3 57.6 85.0 36.5
5 other m 5 94.1 56.0 88.5 38.0
5 other m 5 98.0 55.6 88.0 37.5
5 other f 7 91.9 56.4 87.0 38.0
5 other m 6 92.8 57.6 90.0 40.0
5 other m 1 85.9 52.4 80.5 35.0
5 other m 1 82.5 52.3 82.0 36.5
6 other f 4 88.7 52.0 83.0 38.0
6 other m 6 93.8 58.1 89.0 38.0
6 other m 5 92.4 56.8 89.0 41.0
6 other m 6 93.6 56.2 84.0 36.0
6 other m 1 86.5 51.0 81.0 36.5
6 other m 1 85.8 50.0 81.0 36.5
6 other m 1 86.7 52.6 84.0 38.0
6 other m 3 90.6 56.0 85.5 38.0
6 other f 4 86.0 54.0 82.0 36.5
6 other f 3 90.0 53.8 81.5 36.0
6 other m 3 88.4 54.6 80.5 36.0
6 other m 3 89.5 56.2 92.0 40.5
6 other f 3 88.2 53.2 86.5 38.5
7 other m 2 98.5 60.7 93.0 41.5
7 other f 2 89.6 58.0 87.5 38.0
7 other m 6 97.7 58.4 84.5 35.0
7 other m 3 92.6 54.6 85.0 38.5
7 other m 3 97.8 59.6 89.0 38.0
7 other m 2 90.7 56.3 85.0 37.0
7 other m 3 89.2 54.0 82.0 38.0
7 other m 7 91.8 57.6 84.0 35.5
7 other m 4 91.6 56.6 88.5 37.5
7 other m 4 94.8 55.7 83.0 38.0
7 other m 3 91.0 53.1 86.0 38.0
7 other m 5 93.2 68.6 84.0 35.0
7 other f 3 93.3 56.2 86.5 38.5
7 other m 1 89.5 56.0 81.5 36.5
7 other m 1 88.6 54.7 82.5 39.0
7 other f 6 92.4 55.0 89.0 38.0
7 other m 4 91.5 55.2 82.5 36.5
7 other f 3 93.6 59.9 89.0 40.0

Possums

Possums

The equation of the line he have is:

\(\hat{y} = 42.7 + 0.573x\)

For possums of total length 85 cm, we estimate the average head length to be:

\(\hat{y} = 42.7 + 0.573*(85) = 91.405\)

Residuals

head_l .fitted .resid
94.1 93.69801 0.4019925
92.5 95.13026 -2.6302607
94.0 97.42187 -3.4218658
93.2 95.41671 -2.2167113
91.5 91.69285 -0.1928530
93.1 94.55736 -1.4573594
95.3 93.98446 1.3155419
94.8 94.84381 -0.0438100
93.4 95.13026 -1.7302607
91.8 93.98446 -2.1844581
93.3 93.98446 -0.6844581
94.9 95.41671 -0.5167113
95.1 93.98446 1.1155419
95.4 95.13026 0.2697393
92.9 91.69285 1.2071470
91.6 91.97930 -0.3793036
94.7 93.98446 0.7155419
93.5 94.27091 -0.7709087
94.4 94.55736 -0.1573594
94.8 93.69801 1.1019925
95.9 97.99477 -2.0947671
96.3 94.84381 1.4561900
92.5 93.69801 -1.1980075
94.4 90.83350 3.5664990
95.8 95.13026 0.6697393
96.0 94.27091 1.7290913
90.5 91.40640 -0.9064023
93.8 92.55220 1.2477951
92.8 93.12511 -0.3251062
92.1 90.83350 1.2664990
92.8 95.98961 -3.1896126
94.3 96.56251 -2.2625139
91.4 93.69801 -2.2980075
90.6 91.69285 -1.0928530
94.4 91.40640 2.9935977
93.3 93.12511 0.1748938
89.3 89.97415 -0.6741491
92.4 88.82835 3.5716535
84.7 85.67739 -0.9773895
91.0 91.11995 -0.1199517
88.4 90.26060 -1.8605997
85.3 86.82319 -1.5231920
90.0 89.11480 0.8852028
85.1 86.25029 -1.1502908
90.7 89.11480 1.5852028
91.4 90.83350 0.5664990
90.1 93.69801 -3.5980075
98.6 91.40640 7.1935977
95.4 91.40640 3.9935977
91.6 93.12511 -1.5251062
95.6 91.40640 4.1935977
97.6 96.27606 1.3239368
93.1 94.84381 -1.7438100
96.9 95.13026 1.7697393
103.1 95.70316 7.3968380
99.9 96.39064 3.5093565
95.1 95.98961 -0.8896126
94.5 94.84381 -0.3438100
102.5 97.70832 4.7916836
91.3 93.12511 -1.8251062
95.7 91.97930 3.7206964
91.3 94.55736 -3.2573594
92.0 93.41156 -1.4115568
96.9 93.98446 2.9155419
93.5 93.41156 0.0884432
90.4 91.97930 -1.5793036
93.3 91.40640 1.8935977
94.1 93.41156 0.6884432
98.0 93.12511 4.8748938
91.9 92.55220 -0.6522049
92.8 94.27091 -1.4709087
85.9 88.82835 -2.9283465
82.5 89.68770 -7.1876985
88.7 90.26060 -1.5605997
93.8 93.69801 0.1019925
92.4 93.69801 -1.2980075
93.6 90.83350 2.7664990
86.5 89.11480 -2.6147972
85.8 89.11480 -3.3147972
86.7 90.83350 -4.1335010
90.6 91.69285 -1.0928530
86.0 89.68770 -3.6876985
90.0 89.40125 0.5987522
88.4 88.82835 -0.4283465
89.5 95.41671 -5.9167113
88.2 92.26575 -4.0657542
98.5 95.98961 2.5103874
89.6 92.83866 -3.2386555
97.7 91.11995 6.5800483
92.6 91.40640 1.1935977
97.8 93.69801 4.1019925
90.7 91.40640 -0.7064023
89.2 89.68770 -0.4876985
91.8 90.83350 0.9664990
91.6 93.41156 -1.8115568
94.8 90.26060 4.5394003
91.0 91.97930 -0.9793036
93.2 90.83350 2.3664990
93.3 92.26575 1.0342458
89.5 89.40125 0.0987522
88.6 89.97415 -1.3741491
92.4 93.69801 -1.2980075
91.5 89.97415 1.5258509
93.6 93.69801 -0.0980075

Residual

Formally, we refer to residuals as :

for the \(i^{th}\) observation, residual \(e_i= y_i - \hat{y}_i\)

DIY-1 {2 mins}

Use the linear model \(\hat{y} = 41 + 0.59x\) to compute the residual for the observation (76.0, 85.1).

DIY-2

If a model underestimates an observation, will the residual be positive or negative?

Residuals

Least Squares

Provides and Objective measure of finding the best line

A line that has the smallest residuals

\(e_1^2 + e_2^2 + ... + e_n^2\)

Interpretation

For a model:

\(\hat{y} = \beta_0 + \beta_1 x\)

The slope(\(\beta_1\)) describes the estimated difference in the predicted average outcome of \(y\) if the predictor variable \(x\) happened to be one unit larger.

The intercept describes the average outcome of \(y\) if \(x = 0\)

Extrapolation

family_income gift_aid price_paid
92.922 21.720 14.280
0.250 27.470 8.530
53.092 27.750 14.250
50.200 27.220 8.780
137.613 18.000 24.000
47.957 18.520 23.480
113.534 13.000 23.000
168.579 13.000 29.000
208.115 14.000 28.000
12.523 25.470 16.530
119.822 21.000 15.000
50.563 17.476 18.524
16.120 22.470 13.530
206.932 11.000 25.000
68.678 25.720 16.280
73.598 32.720 9.280
218.120 23.000 19.000
89.983 16.000 20.000
271.974 20.000 22.000
118.165 24.000 18.000
108.395 15.500 26.500
235.522 7.000 35.000
78.926 20.000 16.000
76.854 23.520 18.480
98.496 14.000 22.000
134.586 10.000 32.000
75.157 21.120 20.880
135.857 21.000 21.000
79.448 27.500 14.500
80.858 20.550 15.450
86.140 14.300 27.700
40.490 18.320 23.680
143.337 18.000 24.000
97.664 10.000 26.000
74.713 21.000 15.000
178.795 13.600 22.400
71.550 20.470 15.530
92.605 21.000 15.000
62.546 21.600 14.400
0.000 27.470 14.530
159.981 25.814 10.186
40.397 25.970 10.030
85.203 25.558 16.442
27.164 20.470 21.530
146.397 17.000 25.000
14.089 20.420 21.580
217.443 20.000 22.000
140.093 15.000 21.000
104.147 17.560 24.440
83.333 23.500 18.500

Extrapolation

lm(gift_aid ~ family_income,
1    data = elmhurst) |>
2    broom::augment() |>
    ggplot(aes(x = family_income, gift_aid)) +
    geom_point(colour = "steelblue") +
    geom_smooth(method = "lm",se = F,linetype =2)+
    geom_segment(aes(xend = family_income, yend=.fitted),
    colour = "red", alpha = 0.5) +
    labs(
        x = "Family Income in 1000 USD",
        y = "Gift Aid in 1000 USD"
    ) +
        theme_minimal()
1
Regression to estimate gift aid received using family income
2
use the model to generate data frame for predicted and residual values along with other estimates.

Extrapolation

Extrapolation

lm(gift_aid ~ family_income,
    data = elmhurst) |>                                   
    broom::tidy() 
# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    24.3       1.29       18.8  8.28e-24
2 family_income  -0.0431    0.0108     -3.98 2.29e- 4

extrapolation

What will be the gift aid received by a student whose Family Income is $1Million?

\(GiftAid = 24.3 - 0.04318FamilyIncome\)



\(GiftAid = 24.3 - 0.0431 * 1000 = - 18.8\)

Does this meant the student will be penalised $18800?

Model fit

\(R^2\) is used to describe the strength of the model fit

It is the amount of variance in response explained by the model

\(\frac{Var(Gift Aid) - Var(Residuals)}{Var(Gift Aid)}\)

Outliers -Intuitive

O1-From IMS 2e

Outliers -Intuitive

O2-From IMS 2e

Outliers -Intuitive

  1. Analyse with and without the outliers. Are the results different? Think why?
  2. Present the differneces for discussion.
  3. Do not remove these without good reason.

High leverage Points

“Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage or leverage points.”



if such points does affect the slope of the line we call these Influential points.

How are these model fits different?

ex1-ims

how would the model fit look for these residual plots?

ex2-ims

Multiple Linear Model

When many variables are associated with the response at once

Why do we need Multiple Linear Model

Why can’t I run several simple liner models?

  1. How would you make a single prediction form many simple linear model?
  2. Each simple linear model will ignore other factors that are associated with the response.

Multiple Linear Model

lm(head_l ~ total_l + sex + age , data = possum) |>
    broom::tidy() |>
    kableExtra::kable()
term estimate std.error statistic p.value
(Intercept) 41.8478037 5.2439842 7.980154 0.0000000
total_l 0.5587474 0.0607432 9.198523 0.0000000
sexm 1.6626893 0.4970961 3.344804 0.0011679
age 0.2975386 0.1325199 2.245237 0.0270001

Are all the predictors zero?

F-statistic to the rescue

\(F = \frac{(TSS -RSS)/p}{RSS/(n-p-1)}\)

But what is large enough?

Variable selection

  1. Forward Selection
  2. Backward Selection

Adjusted \(R^2\)

\(R_{adj} ^2 = 1 - \frac{s_{residual}^2/(n-k-1)}{s_{outcome}^2/(n-1)}\)



Why not just use \(R^2\)?

“The adjusted R-squared adjusts for the number of terms in the model. Importantly, its value increases only when the new term improves the model fit more than expected by chance alone. The adjusted R-squared value actually decreases when the term doesn’t improve the model fit by a sufficient amount.”

Prediction

  1. \(\beta_0, \beta_1,...\beta_p\) are estimates of population parameters. It is related ot the reducible error we talked about in the Bias-variance tradeoff. To address this uncertainty we use confidence Intervals. This is how close \(\hat{Y}\) is to \(f(X)\).

  2. Even if we got perfect estimates of the paramenters, we have to deal with the irreducible error () that is hidden in every realization of \(Y\). To indicate this we use prediction intervals. This is how much \(Y\) vary from \(\hat{Y}\).

Prediction

 predict(lm(head_l ~ total_l, possum), 
 newdata = tibble(total_l = 85), 
 interval = "confidence")
      fit      lwr      upr
1 91.4064 90.84497 91.96783
 predict(lm(head_l ~ total_l, possum), 
 newdata = tibble(total_l = 85), 
 interval = "prediction")
      fit      lwr      upr
1 91.4064 86.22807 96.58474

Problems - Non-linear response-predictor realtionship

Problems - Non-linear response-predictor realtionship

Problems - Non-linear response-predictor realtionship

Heteroscedasticity - non-constant variance of error terms

from ISLR

Exercise

  • Use the duke_forest data from {openintro}
  • Your goal is to model the price of the house
  • Begin by reading the data documentation
  • Carry out necessary exploratory analyses
  • come up with a model
  • Implement it
  • Carry out diagnostics
  • Tune the model if needed

Readings

Intro to modern statistics Chapters 7 and 8.